Model Selection

Vision-Language Joint Modeling

# Vision-Language Joint Modeling

Vica2 Stage2 Onevision Ft

ViCA2 is a 7B-parameter multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.

Transformers English

Ret CLIP ViT L 14

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by fusing multi-level representations from vision and text backbone networks.

Multimodal Fusion

Image Captioning Model

A model combining Vision Transformer (ViT) with natural language processing to automatically generate natural language descriptions for input images

Paligemma Multimodal Query Rewrite

A multimodal query rewrite model fine-tuned based on google/paligemma-3b-pt-224

Llava V1.6 Vicuna 7b

LLaVA is an open-source multimodal chatbot, fine-tuned on large language models using multimodal instruction-following data.

LLaVA is a multimodal large model that achieves general-purpose visual assistant capabilities by connecting a visual encoder with a large language model

Filmtitle Beit GPT2

A Chinese movie poster title generation model based on BEiT visual encoder and GPT2 text decoder

Transformers Chinese

MatCha is a pre-trained model that enhances the ability of vision-language models to process chart and language data, excelling in chart question answering tasks

Transformers Supports Multiple Languages

Matcha Chart2text Pew

MatCha is a vision-language model based on the Pix2Struct architecture, specifically optimized for chart comprehension and numerical reasoning tasks, excelling in chart-based question answering.

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase